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ABSTRACT 

Current practice in educational program evaluation 
was examined through analyses of 232 reports submitted, from 1980 to 
1983, by institutions seeking approval tor their programs from the 
U.S. Department of Education's Joint Dissemination Review Panel 
(JDRP) . The JDRP reviews these reports to determine whether 
educational programs have demonstrated that they are effective. This 
study also examined how evaluation methods differed for programs 
which were approved or not approved by the JDRP during this time 
period. Certain features were found more often in programs approved 
than those not approved. These included: (1) the presence of an 
independent evaluator affiliated with a research firm; (2) the use of 
more than one evaluation design; (3) the absence of obvious errors in 
data analyses; and (4) the implementation of evaluation designs of 
high quality. Features of the educational programs and their 
evaluations were documented through content analyses. Descriptive 
profiles were developed for the entire sample, as well as the 
subsamples of approved and not-approved programs. Regression analyses 
were used to relate differences in evaluation methodology to 
differences in the programs* effect size. The appended tables include 
summary data on program content, approvals, grade level, evaluators, 
outcome measures used, test validity and reliability, evaluation 
design, data analysis, and effect size. A 36-item reference list 
concludes the docum. i^.. (MAC) 



* Reproductions supplied by EDRS are the best that can be made * 

* from tb original document. * 
**************************** .************************************ .:sr**** 



ERIC 



oo 
oo 

CO 

ON 

oo 

Q 



BEST COPY AVAILABLE 



Practices in Educational Program Evaluation, 1980-1983 

Kathleen Bodisch Lynch, Ph.D. 
Office of Medical Education 
University of Virginia School of Medicine 
Box 382 

Charlottesville, Virginia 22908 



A paper presented at the 
American Educational Resesurch Association Annual Meeting 

Washington, D.C. 
Auril 1987 



0 

V) 

N 



ERIC 



Abstract 

Current practice in educational program evaluation was examined 
through analyses of reports submitted by educational programs 
seeking approval from the U. S. Department of Education's Joint 
Dissemination Review Panel (JDRP). The JDRP reviews these reports 
to determine v>*iether educational programs have convincingly 
demonstrated that they are effective. In this study, features of 
the educational programs and their evaluations were documented 
throug"^ content analyses of 232 reports submitted to t^ie JDRP from 
1980 through 1983. Descriptive profiles were developed for ti*e 
sample as a >*iole, as well as for the sabsamples of approved and 
not-approved programs. Regression analyses were used to relate 
differences in evaluation methodology/ to differences in the size 
of educational effects detected by the programs. 
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Practices in Educational Program Evaluation, 1980-1983 



The birth of educational program evaluation as a distinct 
field of study in the United States may be traced to the 
Elementary and Secondary Education Act (ESEA) of 1965 • Included 
In this legislation, vdiich mandated a number of federally fxsnded 
programs to improve the perfoznance of disadvantaged children, vgas 
the requirement that educational projects be accountable for their 
vise of federal monies. Congress wanted to kix>w v^ther large 
expenditures of funds helped to bring about any real improvements 
In education. 

Acting on the federal requirements, educators began to 
evaluate their own efforts. Unfortunately, their task was made 
conplicated both by the absence of clear guidelines from Congress 
as to viiat questions needed to be answered, and by the fact that 
program evaluation was not vghat educators were typically trained 
to do. These conditions set the stage for scholars to turn their 
attention to the developnent of theories and models of educational 
program evaluation. By 1973, the field had become well enough 
established that Wbrthen and Sanders were able to assemble a book 
about the varieties of evaluation a^roaches, with chapters 
contributed by many well-known theoreticians and practitioners in 
education, psychology, and other areas of social science research. 

The proliferation of approaches to educational program 
evalucition spawned by the 1965 Congressional mandate resulted in a 
further request by Congress, in the Education Amendments of 1978. 
New, a study of the educational evaluation practices themselves 
was desired. In response, the Na' ional Academy of Science, at the 
invitation of the (then) Office of Education, \axtertoc* a study 
whose purpose was to reccnnend ways of increasing the 
ef) ^ctiveness and xisefulness of the Office of Education's 
evaluation efforts. This stu*/, ti^ie results of vAiich were 
compiled in a 1981 boc^ edited by Raizen and Rossi, is one of a 
very few systematic reviews of the types aiKi quality of 
educaticml evaluaticxis. A more recent effort describing "the 
state of the art and the sorry state of the science" of evaluation 
was reported by Lipsey, Crosse, Dunkle, Pollard, and Stobart 
(1985). The need for documentation of evaluation practice in 
real-life settings has long been noted by respected researchers 
and practitioners (see, e.g., Boruch & Cordray, 1980; Cook & 
Cruder, 1978; Smith, 1979). The inprovement of educational 
program evaluation depends on. the accumulation of knowledge about 
the successes and failures of evaluation methods as actually 
applied in the classroom. 
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Background and Purpose the Study 

The prioary objective of this stixJy was to document current 
practice iii educational program evaluation through a systematic 
analysis of reports submitted to the U. S. D^aartment of 
Education's Joint Disseminaticai Review Panel (JI»P) during a four- 
year period. A secondary purpose was to determine how e\alu^tion 
methods differed for programs which were a^roved or not approved 
by the JDRP during this time period. 

The JDRP is a body of experts from the Department of 
Education whose purpose is to review educational products and 
practices from all over the United States in order to de'tennine 
whether they are effective. Tt^ JDRP makes its judgments through 
consideration of 10*page written reports supplied by the programs 
seeking JDRP e^jproval. These reports describe the program's 
goals, activities, costs, implementation requirements, evaluation 
procedures, and evidence of effectiveness. The JDRP reviews the 
reports to determine ^Nhether educational programs have 
convincingly demonstrated that they are effective. Program staff 
also make an oral presentation before a subgroup) of three to seven 
members of the JDRP. Decisions concerning approval or rejection 
of an educational program are made on the basis of a slnple 
majority vote of this subgroup. 

The JDRP considers educational products and practices v*iich 
it has approved to be worthy of nationwide dissemination. 
Consequently, approval is given to a program only if two major 
conditions are met: (a) the program must persuasively demonstrate 
that it is effective, and (b) the program must be effective to an 
exemplary degree. To satisfy the first condition, programs must 
document their evaluation procedures and show that the observed 
effects are attributable to die program itself, rather than to 
other plausible e>q)lanatory factors (Datta, 1977; Tallmadge, 
1977). In addition, the intervention and its effects should be 
replicable. 

To satisfy the second condition (i.e. , that the degree of 
program effectiveness is exfimpleu:Y) , the program must convince the 
JMIP that the effects produced are of sufficient magnitude to be 
conr^dezTed statistically and educationally significant. While 
statistical significance can be assessed through conventional 
techniques of data analysis, the determination of v^iat constitutes 
educationsQ significance is more difficult. The JDRP uses a 
combination of judgmental and normative approaches (Sechrest & 
Yeaton, 1981) for evaluating the educational significance of the 
size of effects produced by the programs it reviews. Panel 
members, as experts in the field, render professional jvidgments 
about the practical significance of the observed effects. They 
also try to detennine vi)ether the progr a m has convincingly shown 
that the gains produced exceed what is typically reported in the 
literature. 
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Method 

During the years 1980 through 1983, 240 educational programs 
applied to the JDRP. Of these, 232 specified changes in 
knowledge, attitude, or bdiavior in stixients, teachers, parents, 
pars^jrofessionals, or other individuals. The other eight programs 
targeted Institutional change, anrl were exclvided from further 
analyses. 

A coding fonn was developed by the author in order to extract 
infoiroation from the reports submitted to the JDRP. The items and 
format chosen were based on a review of the literature on 
assessnpent of the adequacy of research or evaluation reports, and 
the literature on meta-ansQysis (see, e.g., Bernstein & Freeman, 
1975; Fang, 1981; Glass, McGaw, & Staiith, 1981; Gordon & Morse, 
1975; Lipsey, 1983; Sanders & Nafziger, 1976). Data were 
collected on characteristics of the educational programs described 
in the JDRP submittals, the methcxls used to evaluate them, the 
size of effects produced, and the quality of the written report. 
Ih^s pe^r will focus on the evaluation designs, and on the 
relationship of dirferent evaluation characteristics to both 
effect size and JDRP a^roval. 

Hie items developed to document the evaluation practice 
represented by the educational programs spplying to f 3 JDRP 
ajring the years 1980 throu^ 1983 were organized into three 
sections: (a) descriptions of the procedures \:ised to measure 
program outcomes, (b) descriptions of the evaluation designs, and 
(c) descriptions of the methods of data analysis. Each of these 
will be discussed. In addition, the procedures used to calculate 
effect sizes and to relate differences in effect size to 
differences in evaluation characteristics will be described. 

Measurqnent Features 

Tt)B affiliation of the evaluator of the educational prodtict 
or practice for which claims of effectiveness were presented to 
the JISR? was noted — v^ther the evaluator was on the program staff 
or was an external consultant . Information was also gathered to 
describe the tests or other l-jstruments ixsed to measure the 
effects of implementing the educational product or practice. 
Included were items on the derivation of the test— vtether 
published, developed specifically for the project, or adapted from 
another test; the types of validity and reliability data reported 
in the subnittad; and the appropriateness of >iroc^ures followed 
during test administration. With regard to tha last item, in the 
case of nonn-referenced tests, the date on which the tests were 
administered was noted; in order for these test scores to be 
interpretable, the tests should have been administered close to 
the time of year at v*dch the empirical norms were established 
("Tually, not more than two weeks on either side of the noimlng 
date) . For outcome measures involving treatment and coitparison or 
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control groups, tests should have been administered to both groups 
at or near the same time. 

Design Fea tures 

In order to Identify the various types of evaluation designs 
reported in the JDEV submittals, a series of descriptive phrases 
was listed on the coding form. Any conblnation of these phrases 
could be recorded to describe a particular evaluation design. 
Response categories incliKSed such characteristics as the number of 
groups involved, whether the groi^js represented no-treatment or 
alternate-treatment conditions, v^ther there was pre- or 
posttesting, whether assignment to groi^ was randomized or not, 
^nd whether noxns were used for conparison purposes. This 
descriptive a^jproach was used in place of assignlr^ labels to the 
evaduation designs In order to document as completely and 
objectively as possible the evaluation practice represented in the 
submittals, without having to force innovative or patched-together 
designs into predefined categories. 

The internal validity of each design identified was rated on 
a scale ranging from very low to very high. Although standards 
for judging the quality of research designs are by no means 
uniformly agreed yjpom (Hlrschl & Selvln, 1967; McTavish, Brent, 
Cleary, & IQiudsen, 1975), Campbell and Stanley (1963) and Cook and 
Campbell (1979) have outllrxKi threats to validity typically 
associated with various e^sproaches to conducting research and 
evaluation studies. Ihelr conc^tualization is generally well 
respected In the field, fvnd was used as a referaice point In 
assigning the ratings of design quality. For each design, a 
jiidgment was wade concerning the degree to vAilch threats to 
Internal validity could be discounted. A rating of very high was 
given when all of the s^Plicable threats to internal validity 
could be ruled out, enabling the reviewer to canclvde with a 
reasonable degree of certainty that it was the educational program 
seeking JDPP s^roval which produced tne claimed effects. A 
ting of very low was given to designs in v^di there was a 
"fatal flaw*'— that is, where at least one of the threats to 
Intem'^il vaJldlty could be considered a conpelllng and plausible 
rival e>q)lanatlcMri for the results obtained. Ratings between very 
hlg^ and very low were assigned as follows. Bvalixation designs 
were rated hl^ when all but one or two of the threats to internal 
validity could definitely be ruled out, when neltlier of the 
possible threats could be considered a fatal flaw, and vdien, 
overall, the evidence that the program caused the observed effects 
was believable. A medium rating was assigned v^ien at least half 
of the threats could be ruled out, there were no fatal flaws, axKi, 
overall, the evidence was ambiguous — neither totally convincing 
nor totally unconvincing. Designs wiire rated low whwi fewer than 
half of ^ threats to internal validity could be rulfid out, and 
he evld was not very convincing, but there were no ^atal 
flaws. 
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Finally, data on the external validity of the evaluation 
designs were collected. Any evidence provided in a submittal to 
indicate that a program had been successfvil / replicated was 
recorded on the coding form. 

Data Analysis Fe atures 

In this section, the descriptive and inferential statistical 
analyses applied to the outcome data were documented. Whether the 
analysis techniques were €^3prppriate for the type of data gathered 
was also noted. AdditionsQly, the fozm in which test results were 
reported in the JDRP suhmitteds was recorded, as well as whether 
tests of statistical significance were used to identify' 
differences between groups. 

Calculation of Effect Size s 

Effect sizes were obtained by transforming program results 
into standard scores for all progr a ms for which the nece^^sary data 
were supplied. In the simplest cases, for studies involving a 
ccnparison between a treatment and a no- treatment group, the 
difference between the means of the treatment group and the 
comparison group wes divided by the standard deviation of the 
comparison group (Glass, 1977). This allows one to describe the 
status of the treatment groqp by reference to the distribution of 
outcome scores which vrould have been esqpected in the absence of 
any intervention. In cases where results were based on oxh&r 
evaliuation designs, adaptations of the basic effect size fozmila 
were used, following reconroerKlations rerorted in the literature 
(see, for eraunple, Bryant, 1982; Glass, 1980; McGaw (k Glass, 
1980). 

After effect sizes were calculated, they were related to 
characteristics of the educational programs and evaluations 
through traditional methods^ of data analysis. In this paper, only 
the reJationships to evaluation characteristics will be discussed. 
Mean effect sizes were calculated for different levels of 
categoricad variables, Pearson product-moment correlations were 
obtained between effect sizes and continuous variables, and 
analyses of variance and linear regression were used to identify 
the prpportion of variance in the distribution of effect sizes 
which was accounted for by characteristics of the evaluations. 

During the four years covered by this study, 165 out of 232 
program narratives reviewed by the JDRP (or 71%) provided the data 
necessary to csQculate effect sizes. Seme JDRP reports provided 
effectiveness data for more than one content area, target 
audience, type of objective, type of outcome measure, type of 
evaluation design, and grade level. Effect sizes were computed 
sq^arately for each of these variab.^es for each program. Within 
programs, effect sizes were then aggregated across grade level. 
When more than one outcome measure was used for a program, effect 
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sizes were also aggregated across tests within each of two types 
of outcome measures— published and locadly develc¥)ed. 
Consequently, the nuaiber of effect sizes retrieved from each 
report varied, with a total of 263 effect sizes obtained. 

Results 

Description of the Educational Programs 

The author suninarized data from 232 reports submitted to the 
JDRP from January 1, 1980 througfri December 31, 1983. The number 
of submittals reviewed in each of these years ranged from 45 to 
68. Sixty-two percent of all the submittals were approved, with 
percentages by year varying from 5751$ to 69%, as can be seen in 
Table 1. 



Insert Table 1 about here 



A wide variety of content areas were cKSdressed in the 
submittals, and some reported on educational programs in more than 
Okie area; a total of 326 programs were described. Over 50* of the 
programs had objectives related to reading or math. The next most 
frequent content areas were, in order, special education, career 
education, language arts, natural science, social science, ai^i 
health/physical education— 39S$ of the programs could be classified 
into these categories. Table 2 presents a carplete listing of the 
content areas addressed in the submittals. 



Insert Table 2 about here 



In almost all cases (n = 304) the target audience for the 
programs was students, ranging from preschool through graduate 
school. For a few programs, the target audience was teachers or 
administrators, adult learners, or parents. Table 3 presents the 
distribution of programs across the various grade levels . As 
might be expected, most of the programs were developed for school- 
aged child3r«i (K through grade 12), with more efforts occurring in 
the elementary and middle schools (K through grade 6) than in the 
junior and saiior high schools (grades 7 throucfli 12) . 



Insert Table 3 about here 
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The types of objectives that the educaticv«tl programs 
addressed were categorized as being either cognitive, behavioral/ 
or attitudinsQ/affective. Almost every submittal (n^* 224, or 
97*) presented at least one objective in the cognitive domain, 
with bdiavicral and attitudinal objectives occurring much less 
frequently (21* and 16* of all submittals, respectively). While 
very few p ixjyramb were desigr«J to effect only behavioral {n.= 4) 
or onJ.y attitudinal (n = 2) changes, program s with only cognitive 
objectives were ccnnwi {n,= 157, or 68* of all submittals). The 
frequencies presented in Table 4 show the number of times the 
different types of objectives and conibinatians of objectives were 
addressed in the submittals reviewed foi this study. 



Insert Table 4 about here 



Description of th e. Program Evalu ations 

Eyaluatpre. Although the JDRP does not require submittals to 
ident Ify the evaluators of the^r programs, over 75* of the 
suhmittads (n^* 176) reviewed during tixj time period of this study 
provided such information. For a small group of the submittals, 
program staff had the sole responsibility for progpram evaluation 
efforts. In over half of the cases, howe^^er, inaepentoit 
evalucttors conducted the evaluations, either alone or in 
combination with program staff or other types of evaluators, such 
as program developers or representatives from district research 
and evaluation offices. Of the independent evoJuators, most were 
identified as having academic affiliations, the rest being 
associated with research or consulting firms. Table 5 displays 
the data cOTceming evaluators' affiliations retrieved from the 
JDRP submittals. 



Insert Table 5 about here 



Outcome measures. Data were collected on all instruments 
v*uch measured outcomes for which claims of effectiveness were 
made. The three types of objectives — cognitive, behavioral, and 
attitudinal/affective— were included. While in most cases {rL = 
113, or in 49* of the submittals) programs based their claims of 
effectiveness on data frcm a single outcome measure, the number of 
instruments described Jn each submittal varied frcm 0 to 7. In 
some submittads, only one instrumrait was used to measure all 
program outcomes (e.g. , reading and math scores from the same 
standardized test). In other submittals, more t'lan one instrumerit 
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was used to measure a single program outcome (e.g., scores froen 
both published and locally developed tests of math achievement) . 
A total of 438 outccme measures were described in the 232 JDPP 
submittals . 

The type of instruments most frequently chosen to measure 
program outcomes (n « 250, or 57* of all Instruments reported) 
were published tests such as the MAT (Metropolitan Achievement 
Ttets), CAT (California Achievement Tests), and SAT (Scholastic 
Aptitude Ttests) . l\«nty-nine percent of the instruments used (n^- 
129) were locally developed outccme measures created for specific 
progrems. In a few cases. Instruments were adopted from other 
sources but modified to noke them more relevant to the program 
being evalioated. Concerning test cK^minlstration, in evaluations 
Involving treatment and ccnparison groups, in 99% of the cases 
tests were given to both groups at the same time; in evaluations 
based on a norm-referenced design, tests were administered at the 
appropriate nomlng times in only 70* of the cases. Table 6 
presents data about ther^e and other features of the outcome 
measures described in the JDFP submittals. 



Insert Table 6 about here 



The amount of Infonnation which was provided about the 
validity and reliability of the outcome measures varied. For the 
majority of the instruments, at least one type of validity and one 
type of reliability was reported (63* and 62%, respectively). In 
a much smaller percenter of cases, more than ot^ type of validity 
or reliability were r'ted. Ths type of validity most frequently 
reported was rontent ^;alidity {53% of the instruments), ij^J the 
type of reliability most frequently reported was internal 
consist»icy (38*) . It should be noted vhat in many cases, the 
only statement made with regard to the validity or reliability of 
the instrumwit was the "the manual stated that these were high." 
Moreover, for about one- fourth of the instruments, no information 
at all was presented as to their validity or reliability/ 

Evaluation designs . Data were collected on each evaluation 
design used to gather evidence to substrantiate the claims of 
effectivOTess made in each submittal. Recall that a submittal 
could describe one or more educationad programs, and each program 
could have one or more types of objectives. Each of these 
objectives, in turn, could be measured by one or more instruments, 
and each of these instruments could have been administered in 
accordance with the requiremaits of a differoit evaluation design. 
For example, in many cases a norm-refer«iced evaluation design was 
used with a published achievement test and a nonequivalent control 
group design was used with a locally developed test, in order to 
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provide complenientary evidence that a particular objective had 
been adiieved. In the 232 JIXV submittals reviewed r a total of 
363 evaluation designs were reported. 

The types of evaluation designs described in the submittals 
were documented on the coding forms by using combinations of 
descriptive £*irases. Table 7 presents the frequencies of 
occurrence of the designs based on the coding form categories. 



Insert Table 7 about here 



The evaluation design roost frequmtly enployed by the programs was 
the nonrandanized pre-post conparison groip design, referred to as 
th& nonequivalent control group des^^m by Campbell and Stanley 
(1963). A total of 87 submittals (or 38*) reported usiixr this 
design. Next most frequaitly used (n^= 67, or 29%) was the norm- 
referenced design, viiich involves pre-post coqsarisons of scores 
based on published norms (Tallmadge & Wood, 1978) . Pollowir^ this 
in frequency was the nonrandomized post-only compariscxi group 
design (n,= 57, or 25*) — the nonequivalait control grot?) design 
without the pretest scores. The randomized pre-post control 
design (one of Canpbell and. Stanley's "true e:q)erin«ital designs) 
and the one-gnxp pre-post design (a "pre-experlroental" design) 
occurred with almost equal frequency (n^= 30 or 13%, arxi n^= 33 or 
1A%, respectively) while all other types of designs occurred fewer 
than 10 tines eacn. 

When categories of designs from the coding form with similar 
characteristics were aggregated, the frequencies presented in 
Table 8 resulte.1. Quasi-e>q5erimental designs, wiiich include the 



Insert Table 8 about here 



nonequivalent control group, slinple time series, and other designs 
all characterized by nonrandomized assignment of subjects to 
treatment or ccniparison groi^, were by far the designs most 
frequently r^rted in the submittals (n_= 170, or 73%). Next in 
frequency were the norm-referenced designs (n_= 88, or 38*) . True 
experimental (randonized assignment to groups) and pre- 
e3q)erimentad (one-group, non-time series) designs were found 
eqijally in the submittals (n.= 50, or 22% each) and only one 
qualitative design was reported in si^^rt of claims of 
effectiveness. The type of evaluation design used a^)pecU7ed to 
have little influence on whether or not an educational program 
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received JTOP approval; as can be seen in Table 8, the different 
types of evaluation designs had similar rates of approval. 

ttie (jjality of each evaluaticxi design vgas rated on a scale 
from very low to very hig^i. 'n:^ results are presented in Table 9. 



Insert Table 9 about here 



The ratings v^iich were assigned tended to clxxster in the mediiam 
and high categories, which represented situations where the 
evidence o^ effectiveness was ambiguous (medium quality) or 
reasonably convincing (hl^ quality). Ov&rall, more thaii half of 
the evaluation designs (i.e., 56%) were rated as medixM or worsen, 
indicating a failxure to produce reasonably believable evidence 
that the educationsQ p rogram was responsible for producir^ the 
observed changes in the groc^ receiving the educational product 
or practice. Ratings indicating certainty that effectiveness was 
not demonstrated (very low) exceeded those irdicallng certainty 
was demonstrated (very high) by a ratio of over 2:1. However, 
when frequencies of ratings were aggrt ated across the categories 
of very low and la^, and very high and nigpi* this ratio revers€fs 
itself; many more evaluations provided convincing evidence of 
effectiveness (n » 159, or 44%) than provided convincing evidence 
that the progrf^m was not effective (n =69, or 19*). JDRP 
approvpi was givwi a greater prcfXDrtion of the time to submittals 
with evaluation designs rate^. as hic^i or very high (79%) than to 
those rated either medium i69%) or low and very low (32%) . 

Tie 'Ustribution of the quality ratings was also broken down 
according, :o categories of dasigns in order to identify 
differences in rated quality d^jendent on design type. The Mrue 
experimental designs received greater proportions of very and high 
i^atings than any of the other designs, and these proportions far 
exceeded those for the sample as a v*K3le. Conversely, true 
experimental designs had smaller proportions of low or very low 
ratings than any of the other designs, as well as the overall 
sample. Quasi-e}q)erimffiital designs received the next best 
distribution of ratings, followed by the norm-referenced designs. 
One-groi?) designs had the poorest ratings, receiving 
disproportionate amounts on both the high and low ends of the 
quality continuum. T^ble 10 presents summary data on design type 
and quality ratings which illustrate these relationships. Mean 



Insert Table 10 about here 
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ratings for each category of design cu:*e also presented ^ not so 
much because they have an Inherent significance, but rather 
because they help to convey the order with v*iich the differ«it 
types of eval\aaticn designs vjere arrayed along the quality 
dimension (when quality is drifir^ed as the extent to vtiidh threats 
to Internal validity can be ruled out) . 

All submittals provided some evidence of the external 
validity of their programs by presenting data Indicating that the 
intervention was effective for more than one instructor, 
classroon, grade level, school, setting, or time period* 
Sometimes results were reported separately for these different 
variables, and for different levels within the variables (e,g,, 
for grades 6, 7, and 8 in schools A and B) . Such presentations 
were more clearly indicative of a program's replicability than 
those In which the program was implemented across variables or 
level*^ but the data were reported in aggregated fom only (e*g. , 
reporting one combined meaii for grades 6, 7, and 8) . The JD!^ 
juppra^ml rate was hlg^r for submittals in which non-aggregated 
replication data were presented for at least one variabl3, than 
for those submittals repoi ..ing only aggregated replication data 
(65* vs. 49*, respectively). 

Description of the Data Analysis Procedures 

The methods used to analyze the data collected to 
substantiate claims of effectiveness ranged from descriptive 
statistics to complex multiple regression analyses. Besides 
descriptive statistics, the type of data analysis reported most 
frequently was the t-test (1^= 147, or 63* of the submittals). 
Arialysis of cuvariance (ANCX)VA) , anaJysis of variance (ANOVA) , and 
nor^sarametric statistics were the next most frequently chosen 
analytical methods (see Table 11) . All of the submittals reported 
the use of tests of statistical significance, with the exception 
of five submittals which provided descriptive statistics only. Of 
this latter groi^, only one was not approved by the JDRP. 



Insert Table 11 about here 



The adequacy of the procedures used to analyze the evaJuation 
data was examined by identifying features v*iich mi^t negatively 
affect the believability and interpretability of the data. For 79 
(or 34*) of the submittals reviewed, no problems in the data 
analyses were noted. In the other submittals, problems included 
the tise of inappropriate or inadequate analysis procedures (n^ = 
71, or 31*), omission of some relevant outcome data (n^= 52, or 
22*) , and omlssicxi of information about the analysis procedures 
used, such as the p value or the name of the statistical test (n = 



ERIC 



14 



12 

31, or 13*). The JTOP approval rate for subinittals with no 
problans in data analysis vtas ccnsjderably higher than that for 
submittals having one or more problems: 71% vs. 50%, 
respectively. 

EffectSizes 

"Hie mean effect size over all the programs for Mhich this 
statistic could be calculated was 0.89 (n - 263, SD - 1.10). Mean 
effect sizes v«ere also calculated for the different levels of 
categorical variables descriptive of different features of the 
evaluation designs used. Table 12 presents these results, v*iich 
are discussed here. 



Insert TcdDle 12 about here 



Evaluator affiliation. The highest effect sizes were 
obtained by those programs evaluated by independent evaluators 
(M = 0.99) , followed by programs evaluated by program staff 
(M = 0.91). Evaluator affiliation accounted for very little of 
the variance in the distribution of effect sizes. 

Ins trument type Programs for which instruments were 
specifically developed to measure program outcomes had mean effect 
sizes almost twice as high as those programs relying on available 
published tests (M = 1.25 and M = 0.67, respectively). Of the 
evaluation characteristics examined for relationships to effect 
size, instrument type was the best single ejqilanatory variable, 
accounting for 12* of the variance in the distribution of obtained 
effect sizes. 

Desi^ type. Designs based on randomized assignment of 
subjects to groups had higher mean effect sizes ($1= 1.13) than 
those based on nonrandomized assignment (M - 0.92) , and both of 
these had higher mean effect sizes than norm-referenced designs 
(M = 0.59). 

Design quality. In this study, the higher the design 
quality, the higher the mean effect size that was found, ranging 
from 0.93 for hl^ quality designs, to 0.89 for medium, and 0.67 
for low. However, the strength of the relationship between design 
quality and effect size was not great (Pearson r^= .09 n^= 262) , 
and design quality accounted for only a negligible proportion of 
the variance in the effect size distribution. 

Data analysis qual ity. Programs for which only one or no 
flaws in the data analysis procedures were noted had higher mean 
effect sizes than those programs which had two or three problems 
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(M * 0.93 and 0,68, respectively). The extent of problems in the 
data analysis acxxxnted for a very small proportion of the 
variance In effect sizes. 

Effect size Jcnmalas. As was described earlier, adjustments 
based en suggestions from the meta-analytic literature were 
sometiines necessax^ when calculating effect sizes from the data 
rqx^rted in the JOES' submittals. Mien effect sizes wsre 
calculated frGm the basic effect size formula (treatment minus 
comparison groi^ poet means divided by the standard deviation of 
the conoparison group) , the mean effect size was considerably 
higher ttjan vdien effect sizes were calculated using other ronmilas 
(M = 1.02 for the fonner, and (4 = 0,58 for the latter) • Ihe use 
of different effect size formilas accounted for 8% of the variance 
in the effect size distribution. 

Coanbini ng Va riabl es to E xplainJBf feet Size 

Multiple regression analyses were run on tne 262 cases for 
Mhlch effect sizes had ?Deen ccmputed. Effect size was specified 
as the depe n de n t variable and type of objective, derivatlOTi of the 
outcone measure, type of evaluation design, evaluatcr affiliation, 
evaluation quality, quality of the data analysis, and formula for 
estimating effect size (basic vs. one of the adapted formulas) 
were the independent or e}q)lanatory variables. Dummy variables 
were created Mhen categoricad variables had more than twD possible 
lev'dls, resulting in a total of 13 e3q)lanatory variables being 
entered into the regression equation. A forward stepwise 
procec u:^ was used, the results of which are presented in Table 
13. 



Insert Table 13 about here 



The largest single ccxitributor to the e3q)lanation of t^^e 
variance in the effect size distribution was that the outcome 
measure was locally develc^)ed; the prqportion of variance 
accounted for by this factor alone was 11.3%. The variable 
cantributing the next largest amount to the proportion of 
explained variance (2.4%) was the type of effect size formula used 
(higher effect sizes were obtained with the basic formula than 
with its variations) . Other variables which entered the 
regres s ion equation were presence of an attitudinal objective, 
presence of a bdiavioral objective, and Independent evaluator. Of 
the five variables which satisfied the criteria for entry into the 
regression equation, all were positively related to effect size 
except the presence of an attitudlnad objective. The multiple 
resulting v^ien these five independent variables were; entered into 
the regression equation was 0.17. 
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Sunmairy and Conclusions 

The effect of the information explosion on the field of 
educational prograan evaluation has been to put evaluators in the 
enbEurrassing position, to par^^jhrase Glass (1976), of knowing less 
than we have ptx>ven. The hundreds of JDBP submittals Mhich now 
exist are a perfect example of what Glass means. They describe 
educational prosrrams and jfvaluation practice in a wide variety of 
content areas and from locations al^ across the United States. As 
such, they are a rich source of data about how to conduct ani 
evaluate educaticncLL programs — ^a source that haE gone largely 
unts^sped (with some exceptions: see F^ng, 1981; Hamilton & 
Mitchell, 1979; Har»y, 1978; The Network, 1978). 

In this study, a total of 232 reports describing educational 
programs seeking JDRP approval dux xf the years 1980 through 1983 
were reviewed. The author develop ^ a JDRP submittal analysis 
form to retrieve and document information about (among other 
things) educational program evalxiation procedures actually being 
used in classrooms across the country. Fbund to be typical of 
evalmtion practice, as represented by this group of programs, 
were these characteristics: (1) the evaluations were conducted by 
Independent evaluators, eithe:^ alone or in coihbination with 
program staff; (2) a single measuring instrument was used, usually 
a published test for which content validity and internal 
consistency were reported; (3) a single evaluation design was 
enployed, most often being a nonequivalent control group design; 
(4) the quality of the evaluations (the degree to which threats to 
internal validity were controlled or eliminated) was not typically 
high enough to produce convincing evidence that the program 
produced the claimed effects; (5) evidence of replication of 
effects was gathered; (6) descriptive statistics and t-tests were 
used to analyze tlie data; (7) and the statistical significance of 
the obtained results was assessed. 

Certain features of the evaluation procedures vmdertaken by 
the educational programs in order to demonstrate effectivaiess 
were found more often in progTams ^jproved by the JDRP than in 
those not approved. These inclxxied: the presaice of an 
independait evalioator affiliated with a research firm, the use of 
more than one evaluation design, the absence of obvious errors in 
the data analyses, and the iDnplementation of evaluation designs of 
hi^ quality (in this study, defined as elimination or control of 
threats to internal validity) . This latter factor was found to be 
a particularly important axisideration to the JDPP, with 79% of 
the evaluations rated as high or very hig^ being implemoited in 
progrrams which were a^roved by the JI^, ccn^sared with only 32% 
of those rated as low or very low. This finding is consist ^t 
with the JORP's stance that the ability of a program to 
demonstrate that observed effects can be attributed to program 
proces s es is of primary importance in their review process (Fcmg, 
1981 ) . 
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Of potentially greater importance to the iraptos/eroent of 
educational program evaluation are the findings relating 
differences in evaluation characteristics to differences in the 
size of effects associated with the ediicational programs. While 
much has been written about the theoretical and statistical 
benefits and hazards of conducting various types of inquiries 
(see, for example, Boruch & McLaughlin, 1982; Campbell & Boruch, 
1975; Gilbert, Light, & Hosteller, 1975; Kamedy, 1981; Rossi, 
1979) , it has been ar ^ that what is really needed for the 
improvement of evalua<.ian is publicly verified evidence of the 
usefulness of plications of evaluation methods in a variety of 
settings (Smith, 1979) . Results of this study suggest that 
decisions about how to conduct evcQuations or how to assess the 
meaning of progiram results should reflect an awar^-iess that 
certain characteristics of evaluatiois may be differentially 
related to the size of effects detected. Ill\istrations of the 
data on which this conclusion is based follow. 

In this study, the use of locadly developed Instruments was 
associated with higher effect sizes than the use of published 
test^. Because published tests anL designed for maximum 
applicability across a wide range of educational experiences, they 
are more effective at measuring general achievements than specific 
learnings (Bsdl, 1981). As the match between the measuring 
instrument and specific program outccroes improves, other things 
being equal, the size of effects detected will increase. Because 
this is true, programs evaluated with the vase of locally devel<:f)ed 
instruments may show larger effect sizes than those evaluated with 
published tests, even If the former programs cire actually less 
effective than the latter. 

Other features Mhich were shown to be related to effect size 
were the type and quality of the evaluation design used, with 
hitler effect sizes associated with randomized designs and designs 
of high quality. Recall that the magnitude of effect size is 
dependent on two factors: the difference between the treatment 
and comparison groi^, and the amount of variance that exists 
within the study. Tb the extent that the evaluator can reduce 
extraneous variance throu^ increased precisian of measuring 
instruments, or through careful planning and Implementing of the 
evaluation design, the size of the effects detected will increase, 
other things being equal (HaQl, 1980; Sechrest & Yeaton, 1982). 
While these features are certainly desirable for all evaluation 
refiearch, whan they do not exist consistently across a sample of 
educational programs being ccnpared, the interpretation of effect 
size for any given program is confounded with the quality of the 
evaluation design. 

Finally, this results of this stxtdy suggest that there is a 
great deal of room for improvement in the quality of educational 
program evaluation being carried out in z*eal-life classroom 
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settings. Fewer than half of the evaluations reviewed met the 
criteria for producing reasonably believable evidence of program 
effectiveness. This finding is particularly troublesome because 
programs viiich apply to the JDRP for validation represent some of 
the finest efforts being made in education in the United States 
today. Continued systematic study of these programs will 
contribute to our understanding of what makes educational programs 
effective, and continued systematic study of the benefits and 
hazards of applyii^ different evaluation methcxte will help \}s 
iooprove the ways we go about assessing eiucational program 
outcomes. 
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Table 1 

JDRP Decisions on Sulmittals Jjv Year 



Submittals ^^roved Not cipproved 

Year n n {%) n {%) 

1980 45 31 (69) 14 (31) 

1981 68 43 (63) 25 'J7) 

1982 61 35 (57) 26 [43) 

1983 58 35 (60) 23 (40) 

Total 232 144 (62) 88 (38) 



ERIC 



20 



18 



Table 2 

Content Areas AWresse^ JDRP Submittals 



Content area 



iReading 
Math 

Special education 
Career education 
Language arts 
Natural science 
Social science 
Health/E*ysical education 
Bilingual education 
Gifted education 
Vocational education 
Writing education 
Teacher education 



Programs 

n {%) 

86 (26) 

81 (25) 

28 (09) 

26 (08) 

22 (07) 

19 (06) 

17 (06) 

15 (05) 



9 
7 
7 
6 
4 



(03) 
(02) 
(02) 
(02) 
(01) 



(table continues) 
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Content area 



Prograris 
n (%) 



Arts/humanities 3 (01) 

Migrant education 1 (01) 

Other 14 (04) 



Note. Based on N = 326 programs total. The sum of 
the frequencies exceeds 326, and the «rum of the 
percaitages exceeds 100 beca\ase seme programs could 
be classified into more one category. 



Table 3 

Number of Programs by Educational .Level 



Programs 

Educational level n % 



Preschool 12 4 

K to grade 3 102 31 

Grades 4 to 6 82 25 

Grades 7 and 8 56 17 

Grades 9 to 12 59 18 

Post secondary 23 7 



Note. Based on N = 326 programs total. The sum of the 
frequencies > 326 and the sum of the percentages > 100 because 
some programs spanned more than one of the abov- categories. 



Table 4 

Tjjpe of Educational Objectives Addressed in JDRP Submittals 



Objective Submittals 
type n (%) 



Cognitive only 157 (68) 

Cognitive and behavioral 30 (13) 

Cognitive and attitudinal 23 (10) 

Cognitive, behavioral, and attitudinal 11 (05) 

Behaviored only 4 (02) 

Behavioral and attitudinal 2 (fl) 

Other 5 (02) 



Total 



232 



(100) 



Table 5 

Ev/aluators' Affiliations 



Submittals 



Types of evaluators n {%)^ 



Program staff only 18 8 

Independent only 92 40 

Academic 57 

Research f Im 28 

Staff plus independent 16 7 

"Other" only 27 12 

Combinations with "other" 23 10 

No infomation/cannot tell 56 24 



^Based on N = 232 sutmittals. ^^^Based on n = a total of 125 
evaluators identified as independent. 



Table 6 

pescriptiara of Outcome M::^asures 



23 



Outcome measures 



Poature 



n 



Type 

Published 
Locally developed 
Modified 
Other 
Admlnlstraticr. 

Norm-referenced 

At norming times 
Hot at norming times 
Treatment/comparison grot^js 
At same times 
Not at sane times 



250 
129 
12 
15 

64 



214 



45 
19 

212 
2 



57 
29 
3 
3 

15 



49 



70" 
30^ 

1^ 



(table ccntlnues ) 



ERIC 



24 



Outcome measures 



Peatiure n SS^ 



Validity information 

Face 16 4 

Contait 233 53 

Construct 61 14 

Criterion 33 8 

Other 60 14 

No information 99 23 

Reliability information 

Stability 72 16 

Equivalence 15 3 

Internal consistency 166 36 

In^errater 33 8 

Other 104 24 

No information 112 26 



^ased on N = 438 instruments total. '^^Based on n = 64. ^Based on 
n = 214. 
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Table 7 

Evaluatian Designs in_li>e Jr;P Sub^ 



Svibmittals 



Type of design 


n 


% 


Nanrandondzed untreated ccmparison pre-post 


87 


38 


Ikmrauidomized untreated concparison post only 


57 


25 


Nonrandomized alternate treatment pre-post 


8 


3 


Nanrandcmized alternate treatment post-only 


5 


2 


Nonrandondzed un- and alternate trt pre-post 


5 


2 


Nonrandomized multiple time series 


1 


<1 


Nonrandomized, other 


2 


1 


National norms pre-post 


61 


26 


National nomts post only 


21 


9 


State/local norms pre-popc 


6 


3 


Randomized xintreated ccniparison post 


30 


13 


Randomized untreated cxn^iarison post only 


9 


4 


Randomized alternate treatment pre-post 


2 


1 


Raxxiomized alternate treatment post only 


1 


<1 


Randomized un- and alternate trt pre-post 


4 


2 



( tabJ.e continues) 
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Type of design 



Submittals 
n % 



Randomized multiple time series 


2 


1 


REmdomized, other 


2 


1 


Qne-groi?) pre-post 


33 


14 


One-group post only 


8 


3 


One-group time series 


5 


2 


Criterion-referenced 


1 


<1 


One-groi?), other 


8 


3 


Qualitative 


1 


<1 


Cannot tell 


4 


2 



Note. Percentages based on N_ = 232 submittals. 
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Table 8 

Type Evaluatloin Design by JDRP^ Decision 



lype oi 


Submittals 


^^roved 


ViOiL cipUiTOVeQ. 


Design 




n_ 


(%)^ 


n 


(%)^ 


n {%)^ 


Quasl-esqjerlmental 


160 


(73) 


111 


(65) 




59 (35) 


Norm-referenced 


88 


(38) 


16 


(67) 


29 (33) 


True e:q)erimental 


50 


(22) 


33 


(66) 


17 (34) 


One-group 


SO 


(22) 


35 


(70) 


IS (30) 


Qualitative 


1 


(01) 


1 


(100) 


0 ( 0) 


Ifote. Total number of 


submittals = 


232. 


The sum of 


the 



frequencies for type of design > 232 because many submittals 
reported more than one evaluation design. 

^ased on N = 232 submittals. '^^Based on n f r type of design. 
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Table 9 

QualltY of Evaluation Design Jby JDRP Decision 



Quality Designs Aj^^roved Not approved 



rating 


n 


(!^)* 


n 


{%)^ 


n 


(*)^ 


Very high 


20 


(06) 


15 


(75) 


5 


(25) 


High 


139 


(38) 


111 


(80) 


28 


(20) 


Medium 


135 


(37) 


93 


(69) 


42 


(31) 


Low 


20 


(06) 


5 


(25) 


15 


(75) 


Very low 


49 


(13) 


17 


(35) 


32 


(65) 



^Based on N = 363 total number of designs. ^^^Based on n for rating 
category. 



Table 10 

Summary Data pn^QualltYJ^tirxjs Design_Type 



Quality rating 
Very high/ Medium Very low/ 
Hi^ Low 

Mean 

Type design n_ % % % rating 



One-group 50 18 34 48 2 • 32 

Nozm- 88 38 48 14 3.18 

referenced 

Quasi- 170 47 37 17 3,22 
experimental 

True 50 74 20 6 3.84 
e}q)erimental 

Qualitative 1 0 100 0 3.00 

Total^ 359 44 37 19 3.17 



Note. Percentages based on row totals. 

^our designs coded as "cannot tell" were eliminated from this 
analysis. 



30 



ERIC 



Table 11 

Data toalysls Methocte in JDRP Submittals 

Submittals 

Method n % 



Descriptive statistics 232 100 

T-tests 147 63 

ANOVA 63 27 

ANOOVA 70 30 

Regression analyses 12 5 

Nor^Dararoetric statistics 58 25 

Qualitative analyses l <i 



Note, The sum of the frequencies > 232 and the sum of the 
percentages > 100 because submittals could include more than one 
method. 
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T^le 12 

Mtean^ Effect Size by Evaluation Design^ Characteristics 



Characteristic n M SD R 



2 



Evaluator 

Independait only 

Staff only 

Combination 
Instrument type 

PubllshBd 

Locally developed 

Other 
Design type 

Norm-referenced 

Quas i-e3<perlmental 

Experimental 
Design quality 

Low/very low 

Medium 

High/very high 



137 .99 .91 
18 .91 .67 

39 .77 .75 

.12 

135 .67 .53 
93 1.25 1.05 
22 .80 .54 

.04 

56 .59 .34 
162 .92 .85 

40 1.13 1.01 

.01 

32 .67 .67 

84 .89 .92 

146 .93 .78 

(table continues) 
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Characteristic n M SD 

Data analysis problems .02 

0 or 1 215 .93 .^^5 

2 or 3 47 .68 .56 

Effect size formula .08 

Regular 187 1.02 .91 

Other 75 .58 .29 



Note. r2 = the proportion of variance in the distribution of 
effect sizes accounted for when th3 evalioation design 
characteristic was considered the sole explanatory variable. 
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Table 13 

l^sf^ite^pf stepwise Regressiqi of Selected ^Variables on Effect 
Size 



Characteristic Increase Direction 

Step entered in of influence 



1 Local instrument .113 .113 + 

2 Basic effect size .137 .024 + 
formula 

3 Attitudinal . 152 .015 
objective 

4 Behavioral .163 .011 + 
objective 

5 IrxJeperKJent .173 .010 + 
e\ralUc3itor 
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